Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation
نویسندگان
چکیده
The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For twoand eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8.9 and 21.7 %, respectively.
منابع مشابه
Robust Feature Extraction for Speech Recognition by Enhancing Auditory Spectrum
The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. In this paper we present a compressive gammachirp filter-bank-based feature extractor that incorporates a method for the enhancement of auditory spectrum and a shorttime feature normalization technique, which, by adjusting the scale and mean of cepstral feat...
متن کاملUse of Multiple Front-ends and I-vector-based Speaker Adaptation for Robust Speech Recognition
Although state-of-the-art speech recognition systems perform well in controlled environments they work poorly in realistic acoustical conditions in reverberant environments. Here, we use multiple front-ends (conventional mel-filterbank, multitaper spectrum estimation-based mel filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterba...
متن کاملتشخیص لهجه های زبان فارسی از روی سیگنال گفتار با استفاده از روش های استخراج ویژگی کارآمد و ترکیب طبقه بندها
Speech recognition has achieved great improvements recently. However, robustness is still one of the big problems, e.g. performance of recognition fluctuates sharply depending on the speaker, especially when the speaker has strong accent and difference Accents dramatically decrease the accuracy of an ASR system. In this paper we apply three new methods of feature extraction including Spectral C...
متن کاملDeep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification
Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech recognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study, a bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation are presented for distant-talking speaker identification, and a comb...
متن کاملOn the role of missing data imputation and NMF feature enhancement in building synthetic voices using reverberant speech
In this paper, we study the role of a recently proposed feature enhancement technique in building HMM-based synthetic voices using reverberant speech data. The feature enhancement technique studied combines the advantages of missing data imputation and non-negative matrix factorization (NMF) based methods in cleaning up the reverberant features. Speaker adaptation of a clean average voice using...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- EURASIP J. Adv. Sig. Proc.
دوره 2015 شماره
صفحات -
تاریخ انتشار 2015